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Abstract 

Research on automatically geolocating social 
media users has conventionally been based on 
the text content of posts from a given user or 
the social network of the user, with very lit¬ 
tle crossover between the two, and no bench¬ 
marking of the two approaches over compara¬ 
ble datasets. We bring the two threads of re¬ 
search together in first proposing a text-based 
method based on adaptive grids, followed by a 
hybrid network- and text-based method. Eval¬ 
uating over three Twitter datasets, we show 
that the empirical difference between text- 
and network-based methods is not great, and 
that hybridisation of the two is superior to 
the component methods, especially in contexts 
where the user graph is not well connected. 
We achieve state-of-the-art results on all three 
datasets. 


in the task of automatically geolocating (predicting 
lat/long coordinates) of users based on their publicly 
available posts, metadata and social network infor¬ 
mation. These approaches are built on the premise 
that a user’s location is evident from their posts, or 
through location homophily in their social network. 

Our contributions in this paper are: a ) the demon¬ 
stration that network-based methods are generally 
superior to text-based user geolocation methods due 
to their robustness; b ) the proposal of a hybrid clas¬ 
sification method that backs-off from network- to 
text-based predictions for disconnected users, which 
we show to achieve state-of-the-art accuracy over all 
Twitter datasets we experiment with; and c) empir¬ 
ical evidence to suggest that text-based geolocation 
methods are largely competitive with network-based 
methods. 


1 Introduction 

There has recently been a spike in interest in 
the task of inferring the location of users of so¬ 
cial media services, due to its utility in appli¬ 
cations including location-aware information re¬ 
trieval (Amitay et al., 2004 1 , recommender sys¬ 


tems dNoulas et ah, 2012| ) and rapid disaster re¬ 
sponse dEarle et al., 2010| ). Social media sites such 
as Twitter and Facebook provide two primary means 
for users to declare their location: (1) through text- 
based metadata fields in the user’s profile; and (2) 
through GPS-based geotagging of posts and check¬ 
ins. However, the text-based metadata is often miss¬ 
ing, misleading or imprecise, and only a tiny propor¬ 


tion of users geotag their posts ([Cheng et al., 2010]). 
Given the small number of users with reliable loca¬ 
tion information, there has been significant interest 


2 Related Work 

Past work on user geolocation falls broadly into 
two categories: text-based and network-based meth¬ 
ods. Common to both methods is the man¬ 
ner of framing the geolocation prediction prob¬ 
lem. Geographic coordinates are real-valued, and 
accordingly this is most naturally modelled as 
(multiple) regression. However for modelling 
convenience, the problem is typically simplified 
to classification by first pre-partitioning the re¬ 
gions into discrete sub-regions using either known 


city locations ( |Han et al., 201 2[ |Rout et al., 20131 ) 
k -d tree partitioning (Roller et al., 2012 


or a 


Wing and Baldridge, 2014). In the k -d tree methods, 


the resulting discrete regions are treated either as a 
flat list (as we do here) or a nested hierarchy. 























2.1 Text-based Geolocation 


Text-based approaches assume that language in 
social media is geographically biased, which 
is clearly evident for regions speaking differ¬ 
ent languages ( |Han et al., 2014| ), but is also re¬ 
flected in regional dialects and the use of re¬ 
gion specific terminology. Text based models 
have predominantly used bag of words features 
to learn per-region classifiers (Ro ller et al ., 2012 


Wing and Baldridge, 2014 1 , including feature selec 


tion for location-indicative terms ( |Han et al., 2012| . 

Topic models have also been applied to model 
geospatial text usage (E isen stein et al ., 2010 


[Ahmed et al., 2013j ), by associating latent topics 
with locations. This has a benefit of allowing for 
prediction over continuous space, i.e., without the 
need to render the problem as classification. On the 
other hand, these methods have high algorithmic 
complexity and their generative formulation is 
unlikely to rival the performance of discriminative 
methods on large datasets. 


2.2 Network-based Geolocation 

Although online social networking sites allow for 
global interaction, users tend to befriend and inter¬ 
act with many of the same people online as they do 
off-line ( [Rout et al., 2013| . Network-based methods 
exploit this property to infer the location of users 
from the locations of their friends ( [Jurgens, 2013 
[Rout et al., 2013| ). This relies on some form of 
friendship graph, through which location informa¬ 
tion can be propagated, e.g., using label propagation 
(Jurgens, 2013; [Talukdar and Crammer, 2009 1. A 
significant caveat regarding the generality of these 
techniques is that friendship graphs are often not 
accessible, e.g., secured from the public (Facebook) 
or hidden behind a heavily rate-limited API (Twit¬ 
ter). 


paper is direct comparison between the respective 
methods over standard datasets. In this, we propose 
both text- and network-based methods, and show 
that they achieve state-of-the-art results on three 
pre-existing Twitter geolocation corpora. We 
also propose a new hybrid method incorporating 
both textual and network information, which also 
improves over the state-of-the-art, and outperforms 
the text-only or network-only methods over two of 
the three datasets. 

3 Data 

We evaluate on three Twitter corpora, each of which 
uses geotagged tweets to derive a geolocation for 
each user. Each user is represented by the concate¬ 
nation of their tweets, and is assumed to come from 
a single location. 

GeoText: around 380K tweets from 9.5K 

users based in contiguous USA, of which 
1895 is held out for development and testing 
( [Eisenstein et al., 2010[ >; the location of each 
user is set to the GPS coordinates of their first 
tweet. 


Twitter-US: around 39M tweets from 450K 
users based in the contiguous USA. 10K users 
are held out for each of development and test¬ 
ing dRoller et al., 2012) ; again users’ locations 
are taken from their first tweet. 


Twitter-World: around 12M English tweets 
from 1.4M users based around the world, of 
which 10K users are held out for each of de¬ 


velopment and testing ( Han et al ., 2012) ; users 
are geotagged with the centre of the closest city 
to their tweets. 


While the 
network-based 


raw accuracies 
methods (e.g., 


reported for 


Jurgens(2013) 


and Rout et al. (2013)) are generally higher than 
those reported for text-based methods (e.g., 
Wing and Baldridge (2014) and Ha n et al. (2014)), 
they have been evaluated over different datasets and 
spatial representations, making direct comparison 
uninformative. Part of our contribution in this 


In each case, we use the established training, 
development and testing partitions, and follow 
Cheng et al. (2010) and Eisenstein et al. (2010) 
in evaluating based on: (1) accuracy at 

161km (“Acc@161”); (2) mean error distance, 
in kilometres (“Mean”); and (3) median error 
distance, in kilometres (“Median”). 












































4 Methods 


4.2 Network-based Label Propagation 


4.1 Text-based Classification 


Our baseline method for text based geolocation is 
based on Wing and Baldridge (2014[ ), who formu¬ 
late the geolocation problem as classification using 
k -d trees. In summary, their approach first discre- 
tises the continuous space of geographical coordi¬ 
nates using a k -d tree such that each sub-region 
(leaf) has similar numbers of users. This results in 
many small regions for areas of high population den¬ 
sity and fewer larger regions for country areas with 
low population density. Next, they use these regions 
as class labels to train a logistic regression model 
(“LR”). Our work is also subject to a sparse l\ regu- 
larisation penalty ( jTibshirani, 1996| >. In their work, 


Wing and Baldridge (2014 1 showed that hierarchi¬ 


cal logistic regression with a beam search achieves 
higher results than logistic regression over a flat la¬ 
bel set, but in this research, we use a flat representa¬ 
tion, and leave experiments with hierarchical classi¬ 
fication to future work. 


For our experiments, the number of users in each 
region was selected from {300,600, 900,1200} to 
optimise median error distance on the development 
set, resulting in values of 300, 600 and 600 for Geo- 
Text, Twitter-US and Twitter-World, re¬ 
spectively. The l\ regularisation coefficient was also 
optimised in the same manner. 


Next, we consider the approach of Jurgens (2013) 
who 


used 


label propagation (“LP”; 


Zhu and Ghahramani (2002)) to infer user locations 
using social network structure. Jurgens (2013) 


defined an undirected network from interactions 
among Twitter users based on @ -mentions in 
their tweets, a mechanism typically used for con¬ 
versations between friends. Consequently these 
links often correspond to offline friendships, and 
accordingly the network will exhibit a high degree 
of location homophily. The network is constructed 
by defining as nodes all users in a dataset (train and 
test), as well as other external users mentioned in 
their tweets. Unlike Jurgens (2013) who only cre¬ 
ated edges when both users mentioned one another, 
we created edges if either user mentioned the other. 
For the three datasets used in our experiments, 
bi-directional mentions were too rare to be useful, 
and we thus used the (weaker) uni-directional 
mentions as undirected edges instead. The edges 
between users are undirected and weighted by the 
number of @-mentions in tweets by either userjj 
The mention network statistics for each of 
our datasets is shown in Table dH Follow¬ 
ing Jurgens (2013}, we ran the label propagation al¬ 
gorithm to update the location of each non-training 
node to the weighted median of its neighbours. 
This process continues iteratively until convergence, 
which occurred at or before 10 iterations. 


As features, we used a bag of unigrams (over 
both words and @ -mentions) and removed all fea¬ 
tures that occurred in fewer than 10 documents, 
following Wing and Baldridge (2014 1 . The features 
for each user were weighted using tf-idf, followed 
by per-user I 2 normalisation. The normalisation 
is particularly important because our ‘documents’ 
are formed from all the tweets of each user, which 
vary significantly in size between users; further¬ 
more, this adjusts for differing degrees of lexical 
variation ( |Lee, 1995] ). The number of features was 
almost 10K for GeoText and about 2.5M for the 
other two corpora. For evaluation we use the median 
of all training locations in the sub-region predicted 
by the classifier, from which we measure the error 
against a test user’s gold standard location. 


4.3 A Hybrid Method 

Unfortunately many test users are not transitively 
connected to any training node (see Table [T), mean¬ 
ing that LP fails to assign them any location. This 
can happen when users don’t use @-mentions, or 
when a set of nodes constitutes a disconnected com¬ 
ponent of the graph. 

In order to alleviate this problem, we use the text 
for each test user in order to estimate their location, 
which is then used as an initial estimation during la¬ 
bel propagation. In this hybrid approach, we first 

'As our datasets don’t have tweets for external users, these 
nodes do not contribute to the weight of their incident edges. 

2 Note that @-mentions were removed in the published 
Twitter-US and Twitter-World datasets. To recover 
these we rebuilt the corpora from the Twitter archive. 























GeoText 


Twitter-US 

Twitter-World 

Acc@161 

Mean 

Median 

Acc@161 

Mean 

Median 

Acc@161 

Mean 

Median 

LR (text-based) 

38.4 

880.6 

397.0 

50.1 

686.7 

159.2 

63.8 

866.5 

19.9 

LP (network-based) 

45.1 

676.2 

255.7 

37.4 

747.8 

431.5 

56.2 

1026.5 

79.8 

LP-LR (hybrid) 

50.2 

653.9 

151.2 

50.2 

620.0 

157.1 

59.2 

903.6 

53.7 

|Wing and Baldridge (2014|) (uniform) 

— 

— 

— 

49.2 

703.6 

170.5 

32.7 

1714.6 

490.0 

|Wing and Baldridge (2014|) (k-d) 

— 

— 

— 

48.0 

686.6 

191.4 

31.3 

1669.6 

509.1 

|Han et al. (2012JI 

— 

— 

— 

45.0 

814 

260 

24.1 

1953 

646 

|Ahmed et al. (2013|> 

??? 

??? 

298 

— 

— 

— 

— 

— 

— 


Table 2: Geolocation accuracy over the three Twitter corpora comparing Logistic Regression (LR), Label Propagation 
(LP) and LP over LR initialisation (LP-LR) with the state-of-the-art methods for the respective datasets (“—” signifies 
that no results were published for the given dataset, and “???” signifies that no results were reported for the given 
metric). 



GeoText 

Twitter-US 

Twitter-World 

User mentions 

109K 

3.63M 

16.8M 


test users: 

23.5% 

27.7% 

2.36% 


Table 1: The graph size and proportion of test users dis¬ 
connected from training users for each dataset. 


estimate the location for each test node using the 
LR classifier described above, before running label 
propagation over the mention graph. This iteratively 
adjusts the locations based on both the known train¬ 
ing users and guessed test users, while simultane¬ 
ously inferring locations for the external users. In 
such a way, the inferred locations of test users will 
better match neighbouring users in their sub-graph, 
or in the case of disconnected nodes, will retain their 
initial classification estimate. 


5 Results 


Table [2] shows the performance of the three 
methods over the test set for the three datasets. 
The results are also compared with the state 
of the art for Twitter-US and Twitter- 


WORLD (Wing and Baldridge, 2014), and Geo- 
Text dAhmed et ah, 20131 ). 

Our methods achieve a sizeable improvement 
over the previous state of the art for all three 
datasets. LP-LR performs best over GeoText 
and Twitter-US, while LR performs best over 
Twitter-World; the reduction in median error 
distance over the state of the art ranges from around 
40% to over 95%. Even for Twitter-World, the 


results for LP-LR are substantially better than the 
best-published results for that dataset. 

Comparing LR and LP, no strong conclusion can 
be drawn — the text-based LP actually outperforms 
the network-based LR for two of the three datasets, 
but equally, the combination of the two (LP-LR) 
performs better than either component method over 
two of the three datasets. For the third (Twitter- 
World), LR outperforms LP-LR due to a combi¬ 
nation of factors. First, unlike the other two datasets, 
the label set is pre-discretised (everything is aggre¬ 
gated at the city level), meaning that LP and LR 
use the same label setjj This annuls the represen¬ 
tational advantage that LP has in the case of the 
other two datasets, in being able to capture a more 
fine-grained label set (i.e., all locations associated 
with training users). Second, there are substantially 
fewer disconnected test users in Twitter-World 
(see TableQ]), meaning that the results for the hybrid 
LP-LR method are dominated by the empirically- 
inferior LP. 

is similar to 

l, we achieved large 
reported results. This 
might be due to: (a) our use of @-mention fea¬ 

tures; (b) 1 1 regularisation, which is essential to 
preventing overfitting for large feature sets; or 

(c) our use of I 2 normalisation of rows in the 
design matrix, which we found reduced errors by 
about 20% on GeoText, in keeping with results 

3 For consistency, we learn a A;-d tree for Twitter-World 
and use the merged representation for LR, but the fc-d tree 
largely preserves the pre-existing city boundaries. 


Although 


LR 


Wing and Baldridge (2014 
improvements over their 


























from text categorisation ( jLee, 1995) . Preliminary 
experiments also showed that lowering the term 
frequency threshold from 10 can further improve 
the LR results on all three datasets. 

LP requires few hyper-parameters and is rela¬ 
tively robust. It converged on all datasets in fewer 
than 10 iterations, and geolocates not only the test 
users but all nodes in the mention graph. Another 
advantage of LP over LR is the relatively modest 
amount of memory and processing power it requires. 


6 Conclusion 


We proposed a series of approaches to social me¬ 
dia user geolocation based on: (1) text-based analy¬ 
sis using logistic regression with regularisation; (2) 
network-based analysis using label propagation; and 
(3) a hybrid method based on network-based label 
propagation, and back-off to text-based analysis for 
disconnected users. We achieve state-of-the-art re¬ 
sults over three pre-existing Twitter datasets, and 
find that, overall, the hybrid method is superior to 
the two component methods.The LP-LR method is 
a hybrid approach that uses the LR predictions as pri¬ 
ors. It is not simply a backoff from network informa¬ 
tion to textual information in the sense that it propa¬ 
gates the LR geolocations through the network. That 
is, if a test node is disconnected from the training 
nodes but still has connections to other test nodes, 
the geolocation of the node is adjusted and propa¬ 
gated through the network. It is possible to add extra 
nodes to the graph after applying the algorithm and 
to geolocate only these nodes efficiently, although 
this approach is potentially less accurate than infer- 
encing over the full graph from scratch. 

Label propagation algorithms such as Modi¬ 
fied Adsorption ( jTalukdar and Crammer, 2009] ) al¬ 
low for different levels of influence between 
prior/known labels and propagated label distribu¬ 
tions. These algorithms require a discretised out¬ 
put space for label propagation, while LP can work 
directly on continuous data. We leave label propa¬ 
gation over discritised output and allowing different 
influence levels between prior and propagated label 
distributions to future work. 

There is no clear consensus on whether text- or 
network-based methods are empirically superior at 
the user geolocation task. Our results show that the 


network-based method (LP) is more robust than the 
text-based (LR) method as it requires a smaller num¬ 
ber of hyper-parameters, uses less memory and com¬ 
puting resources, converges much faster and geolo¬ 
cates not only test users but all mentioned users. The 
drawback of LP is that it fails to geolocate discon¬ 
nected test users. So for connected nodes - the ma¬ 
jority of test nodes in all our datasets - LP is more 
robust than LR. Text-based methods are very sen¬ 
sitive to the regularisation settings and the types of 
textual features. That said, with thorough param¬ 
eter tuning, they might outperform network-based 
method in terms of accuracy. 

In future work, we hope to look at different types 
of network information for label propagation, more 
precise propagation methods to deal with non-local 
interactions, and also efficient ways of utilising both 
textual and network information in a joint model. 
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